Tree-based methods stratify or segment the predictor space into a number of simple regions1.
As the spliting rules to make these decision regions can be summerised in a tree structure, these approaches are called decision trees.
A decision tree can be thought of as breaking data down by asking a series of questions in order to categorise samples into the same class.
Root node: no incoming edge, zero, or more outgoing edges.
Internal node: one incoming edge, two (or more) outgoing edges.
Leaf node: each leaf node is assigned a class label if nodes are pure; otherwise, the class label is determined by majority vote.
Parent and child nodes: If a node is split, we refer to that given node as the parent node, and the resulting nodes are called child nodes.
Figure 1: Categorical Decision Tree
The "palmer penguins" dataset2 contains data for 344 penguins from 3 different species and from 3 islands in the Palmer Archipelago, Antarctica.
Artwork by @allison_horst
Figure 2: Penguin Buddies
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
| 1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
| 3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
| 4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
An algorithm starts at a tree root and then splits the data based on the features that give the best split based on a splitting criterion.
Generally this splitting procedure occours until3,4...
Below is an example of a very shallow decision tree where we have set max_depth = 1.
We can make the tree "deeper", and therefore more complex, by setting the max_depth = 3.
We could also use more than 2 features as seen below.
We could also also easily extend this to have more than a 2 (binary) class labels.
We can estimate the probability that an instance belongs to a particular class easily.
For our new observation we...
Example
Using the model in figure 7, we could find the probability of the following penguins species:
| bill_length_mm | flipper_length_mm | |
|---|---|---|
| 65 | 41.6 | 192.0 |
| Adelie | Gentoo | Chinstrap | |
|---|---|---|---|
| 0 | 0.0 | 0.0204 | 0.9796 |
Predicted Species is Chinstrap
In a general sense this approach is pretty simple, however there are a number of design choices and considerations we have to make including5:
Most decision tree algorithms address the following implimentation choices differently5:
There are a number of decision tree algorithms, prominant ones include:
Scikit-Learn uses an optimised version of the Classification And Regression Tree (CART) algorithm.
An algorithm starts at a tree root and then splits the data based on the feature, $f$, that gives the largest information gain, $IG$.
To split using information gain relies on calculating the difference between an impurity measure of a parent node, $D_p$, and the impurities of its child nodes, $D_j$; information gain being high when the sum of the impurity of the child nodes is low.
We can maximise the information gain at each split using,
$$IG(D_p,f) = I(D_p)-\sum^m_{j=1}\frac{N_j}{N_p}I(D_j),$$where $I$ is out impurity measure, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples in the $j$th child node.
Some algorithms, such as Scikit-learn's implimentation of CART, reduce the potential search space by implimenting binary trees:
$$IG(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left})-\frac{N_\text{right}}{N_p}I(D_\text{right}).$$Three impurity measures that are commonly used in binary decision trees are the classification error ($I_E$), gini impurity ($I_G$), and entropy ($I_H$)4.
This is simply the fraction of the training observations in a region that does belongs to the most common class:
$$I_E = 1 - \max\left\{p(i|t)\right\}$$Here $p(i|t)$ is the proportion of the samples that belong to the $i$th class $c$, for node $t$.
For all non-empty classes ($p(i|t) \neq 0$), entropy is given by
$$I_H=-\sum^c_{i=1}p(i|t)\log_2p(i|t).$$The entropy is therefore 0 if all samples at the node belong to the same class and maximal if we have a uniform class distribution.
For example in binary classification ($c=2$):
Gini Impurity is an alternative measurement, which minimises the probabilty of misclassification,
$$ \begin{align} I_G(t) &= \sum^c_{i=1}p(i|t)(1-p(i|t)) \\ &= 1-\sum^c_{i=1}p(i|t)^2. \end{align} $$This measure is also maximal when classes are perfectly mixed (e.g. $c=2$):
$$ \begin{align} I_G(t) &= 1 - \sum^c_{i=1}0.5^2 = 0.5. \end{align} $$Decision trees allow us assess the importance of each feature for classifying the data,
$$ fi_j = \frac{\sum_{t \in s} ni_t}{\sum^m_t ni_t} $$where $ni_t$ is the $t$th nodes importance, and $s$ are the indices of nodes that split on feature $fi_j$.
We often assess the normalized total reduction of the criterion (e.g. Gini) brought by that feature,
$$ normfi_j = \frac{fi_j}{\sum^p_j fi_j}. $$Question: When do we stop growing a tree?
Occam’s razor: Favor a simpler hypothesis, because a simpler hypothesis that fits the data equally well is more likely or plausible than a complex one5.
To minimize overfitting, we can either set limits on the trees before building them (pre-pruning), or reduce the tree by removing branches that do not significantly contribute (post-pruning).
Digitized image of a fine needle aspirate of a breast mass. The features describe characteristics of the cell nuclei present in the image.
The dataset was created from digitized images of healthy (benign) and cancerous (malignant) tissues.
Image from Levenson et al. (2015), PLOS ONE, doi:10.1371/journal.pone.0141357.